Performance Characterization and Parallelization of Tesseract Optical Character Recognition on Multicore Architectures

نویسندگان

  • Sunghwan Bae
  • Jialing Zhang
  • Seung Woo Son
چکیده

Optical Character Recognition, or OCR, is one of the major topics in computer vision technology. It is widely used in various applications, such as a digital libraries, automatic banking systems, and mailing services. Tesseract OCR Engine, which we evaluate in this paper, is one of renowned OCR programs. It was originally developed by Hewlett Packard Lab between 1985 and 1995, and has been maintained by Google since 2006 [1]. Initially, this program was designed to recognize English text only, however, it has been enhanced to support other languages as more training models were added [2]. OCR process including Tesseract is known to be very compute intensive because of the computation involving image and mathematical processing to obtain higher recognition accuracy. While there has been a significant improvement in the recognition accuracy of Tesseract OCR, parallelization has not been extensively studied. Also, there is a plethora of multicore architectures, thus it is highly considered to achieve better performance by utilizing those parallel architectures. In this paper, performance characterization has been performed using a profiling tool to find a target, and then, parallelizing the identified target using an appropriate parallel programming method. The main goal of this paper is characterizing the performance of the Tesseract OCR program and accelerating compute-intensive loops on multicore architectures. Our main contributions are as follows. First, we analyze the Tesseract OCR program using a binary instrumentation tool and identify target loops that can be parallelized. This allows us to decide which parallelization method needs to be applied and how to modify the loops in order to make them run in parallel. Second, we apply parallel methods on the loops based on the characterization analysis. Lastly, we discuss issues and limitations in parallelizing the Tesseract OCR and suggest appropriate solutions as needed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine

In the present work, we have used Tesseract 2.01 open source Optical Character Recognition (OCR) Engine under Apache License 2.0 for recognition of handwriting samples of lower case Roman script. Handwritten isolated and free-flow text samples were collected from multiple users. Tesseract is trained to recognize user-specific handwriting samples of both the categories of document pages. On a si...

متن کامل

Recognition of handwritten Roman Numerals using Tesseract open source OCR engine

The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine. Tesseract is trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set. The system is trained with 1226 digit samples collected form the different users. The per...

متن کامل

Boosting Optical Character Recognition: A Super-Resolution Approach

Text image super-resolution is a challenging yet open research problem in the computer vision community. In particular, low-resolution images hamper the performance of typical optical character recognition (OCR) systems. In this article, we summarize our entry to the ICDAR2015 Competition on Text Image Super-Resolution. Experiments are based on the provided ICDAR2015 TextSR dataset [3] and the ...

متن کامل

Tesseract Ocr: a Case Study for License Plate Recognition in Brazil

This paper presents the analysis of Google’s Tesseract OCR for license plate recognition in Brazil. The performance results presented for Tesseract OCR will be compared to market grade OCR products known here as “A” and “B”. This is a necessary measure due to a confidentiality agreement with the company supporting this research. The use of OpenCV is also considered due to limitations inherent t...

متن کامل

Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study

Optical character recognition (OCR) method has been used in converting printed text into editable text. OCR is very useful and popular method in various applications. Accuracy of OCR can be dependent on text preprocessing and segmentation algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, complex background of image etc. We begin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016